Data Science without the Data

Rhian Davies | @statsRhian

About Me 👋

  • Data Scientist at Jumping Rivers
  • RSS Statistical Ambassador
  • Bad at French (Je suis désolé 😳)

Cartoon of a woman holding out a book

About Jumping Rivers

  • Data science & machine learning
  • Training courses
  • Dashboard development and deployment
  • Infrastructure
  • Managed Posit services

Cartoon of three people working at computers

I’m going to tell you a story

The Client

  • Database of patients with a rare disease
  • Consulted us to perform the data analysis for a study
    • 200 statistical results (count, %, mean, sd, median, IQR)
    • Interrupted Time Series Analysis

A cartoon robot holding a testtube and wearing a lab coat

Stratifications

  • Country
  • Subtypes of the disease
  • Mobility
  • Drug
  • Year
  • Age of patient

For example

  • What is the average age of patients when they are diagnosed (by country and subtype)?

  • What percentage of patients are taking Drug A (by country, subtype and year)?

Simple, yes?

data |>
  group_by(country, subtype) |>
  summarise(mean = age_at_diagnosis)

The challenge 🙈

  • Write a detailed Statistical Analysis Plan without seeing any data
  • Start development with a small subset of the data
  • We can’t see the data for Germany ever

Time to chat 💬

  • Have you experienced scenarios have led you to having no data?

  • What problems did you encounter?

Our plan

The power of statistical summaries

  • For each dataset, calculate all the summaries we might need
  • Combine these summaries as we like
    • Mean: \(\frac{1}{N} \sum_{i=i}^{N} x_{i}\)
    • Standard deviation: \(\frac{1}{N - 1} \sqrt{\sum_{i=i}^{N} x^2_{i} - (\sum_{i=i}^{N} x_{i} )^2}\)

A small cartoon robot stood next to a huge pile of data

Develop an R package

  • Run it on the data we can see
  • Send it to Marcus
  • He sends us an .RDS
  • We can aggregate and plot as needed
devtools::install_local("describeDisease.tar.gz")
library("describeDisease")
run_analysis("path/to/german.xlsx")

Cartoon people holding wraped presents

Where to develop?

  • Data security is important
  • Client wanted controlled access and logs
  • Shared projects
  • Multiple sessions

The posit workbench logo

Data exploration

  • What values are unique per patient?

  • Which stratifications are viable?

  • Quarto document for data exploration and validation

The posit workbench logo

Data validation packages 📦

What happened?

Sure, we’ll send you dummy data

Oh no

  • Real data shuffled
  • It was an XLSX worksheet

Cartoon figure saying 'Oh no'

Sure, we’ll send you the schema

Database schema for a single indicator listing allowed entries

Oh no

  • Data didn’t match the specification
  • Data types not defined

Cartoon figure saying 'Oh no'

Sure we’ll send you validated data

Oh no

  • It wasn’t validated.
  • Patients with stop dates but no start dates
  • Patients with start & stop dates but with the drug name missing

Whose responsibility is it?

Cartoon figure saying 'Oh no'

Okay let’s run the analysis

Oh no

Hi Rhian, I have run the code, unfortunately I get the error you can see below.

Error in `purrr::map()`:■■■■■■■■■■■■■■■■■               53% | ETA: 11s
In index: 18.
Caused by error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
Column `time_axis` is not found.

Cartoon figure saying 'Oh no'

Generating results…

Oh no

#' eval: TRUE
wb = openxlsx::createWorkbook("Results")
openxlsx::addWorksheet(wb, "Analysis by country, subtype and drug")
Cartoon figure saying 'Oh no'

Final run

Sure, I’ll run it right away and let you know!

Oh no

Unfortunately, I get the error below. The same error also appears when I only use the data that you already have, which is strange because I suppose that you have already tested this script on that data.

    Error in `dplyr::left_join()`:
    ! `...` must be empty.
    ✖ Problematic argument:
    • relationship = "many-to-many"

Cartoon figure saying 'Oh no'

{dplyr} version

  • We specified {dplyr} v1.1.0
  • We needed to specify {dplyr} v1.1.1
  • {renv} or Docker would have avoided this

Diffify hex sticker a red package symbol next to a green package symbol

Tada 🎉

Facetted ggplot graph showing points and standard deviation

In hindsight

  • Push back earlier to evidence the data challenges
  • Set realistic expectations
  • Use a proper database
  • purrr::map2() with tidyr::nest() was a helpful workflow
  • Use a different git workflow
  • Use {renv} from the start

Questions?